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Conversion  from  Data-flow  to  Synchronous  Execution 
in  Loop  Programs 

Janice  E.  Cany 
Lawrence  Snyder 

Purdue  University 

The  preparation  of  highly  parallel  programs  is  not  yet  a  routine  pro¬ 
gramming  activity.  When  we  compare  it  to  sequential  programming 
where  there  are  numerous  general  problem  solving  techniques,  extensive 
programming  language  and  system  support,  and  a  large  corpus  of 
thoroughly  analyzed  and  tested  algorithms  and  data  structures,  parallel 
programming  is  presently  at  a  very  primitive  stage  of  development. 

One  difficulty  of  course,  is  synchronization  -  making  sure  that  the 
right  processor  processes  the  right  data  at  the  right  time.  The  synchroni¬ 
zation  problem  can  apparently  be  simplified  by  use  of  a  data-driven  or 
data-flow  based  execution  mode.  In  this  mode,  each  processor  idles  in  a 
busy-wait  loop  until  data  values  have  arrived  from  all  of  its  input 
sources;  it  then  computes  and  writes  results  out  to  other  processors. 
Parallel  programming  is  simplified  because  much  of  the  synchronization 
is  accomplished  implicitly  by  the  underlying  machine. 

The  data-flow  execution  mode  does  not  eliminate  synchronization  as 
a  problem  of  parallel  computation,  it  only  eliminates  it  as  a  problem  for 
the  programmer.  The  underlying  hardware  must  still  service  the  arrival 
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of  data  (asynchronously),  determine  when  sufficient  data  has  arrived  to 
initiate  processing,  support  queues  for  all  of  the  input  channels  to  hold 
the  arriving  data,  and  implement  a  "queue  is  full"  signalling  mechanism 
with  the  input  data  queues.  These  hardware  facilities  represent 
significant  overhead  and  are  incompatible  with  current  efforts  in  the 
design  of  VLSI  multiprocessors  toward  very  simple  processor  structure. 

In  this  paper,  we  consider  the  automatic  conversion  of  data-flow  pro¬ 
grams  into  equivalent  synchronous  programs.  Such  conversions  enable 
programmers  to  program  as  though  the  underlying  machine  executed  in 
a  data-flow  mode,  while  allowing  the  hardware  to  execute  synchronously. 
We  begin  with  a  model  of  parallel  computation  in  which  we  can  express 
both  data-flow  and  synchronous  computations.  Within  this  model,  we 
define  a  restricted  class  of  programs  and  characterize  the  conditions 
under  which  a  conversion  from  data-flow  to  synchronous  execution  is  pos¬ 
sible.  Finally,  we  present  two  algorithms  for  performing  the  conversion: 
the  first  is  more  general  but  the  second  often  produces  better  results. 
Although  our  algorithms  apply  only  to  a  subclass  of  all  parallel  programs, 
it  is  sufficiently  rich  to  encompass  many  of  the  recently  developed  paral¬ 
lel  and  systolic  programs. 

The  Model  of  Parallel  Programs 

The  formalism  that  we  use  to  develop  our  algorithms  and  prove  their 
correctness  is  quite  spare.  In  order  to  connect  it  with  conventional  paral¬ 
lel  computation  settings,  we  give  an  informal  description  of  the  situation 
from  which  we  have  abstracted. 

Wc  postulate  a  parallel  processor  composed  of  m.  machines 
M\,MZ . Mm  which  communicate  with  read  and  write  operations.  The 


-3- 

machines,  referred  to  as  processing  elements  or  PEs,  are  all  of  the  same 
type.  In  general,  PEs  will  be  sequential  KAMs  with  small  amounts  of  local 
memory  (and  no  global  memory)  but  it  is  sufficient  to  let  them  be  devices 
capable  of  defining  a  regular  set.  This  simplification  is  valid  because  we 
are  concerned  here  only  with  a  PE’s  interprocess  input/output  behavior 
and  not  its  computational  ability.  We  assume  that  the  machines  execute 
with  a  common  time  step;  on  each  step  a  PE  can  attempt  to  perform  a 
set  of  operations  simultaneously’.  In  synchronous  mode,  all  operations  will 
execute  the  first  time  that  they  are  attempted.  In  data-flow  mode,  writes 
will  execute  as  soon  as  they  are  attempted  but,  depending  on  the  state, 
reads  may  block.  A  blocked  operation  is  retried  on  the  next  execution 
step  and  a  process  does  not  proceed  with  a  new  set  of  operations  until  all 
of  its  current  operations  have  completed. 

We  model  such  systems  as  Interprocess  Communication  (IC)  Sys¬ 
tems.  An  IC  system  is  completely  defined  by  a  set  of  regular  expressions, 

Vt.V2 . Vm,  each  describing  the  interprocess  input/output  behavior  of  a 

single  HE.  The  i-Lh  regular  expression  describes  the  behavior  of  the  i-th 
machine.  The  algorithms  developed  in  this  paper  work  for  loop  programs 
in  which  all  regular  expressions  are  of  the  form  a*  where  a  is  a  sequence 
of  symbols  from  the  alphabet.  We  define  p  to  be  a  function  on  expres¬ 
sions  that  removes  the  outermost  Kleene  star;  p(a*)  =  a.  The  symbols  in 
our  regular  expressions  denote  sets  of  operations  that  are  to  be  executed 
simultaneously.  The  alphabet  is  the  power  set  of  |ic[m]j  ^  where  rt 
denotes  a  read  from  HE  j,  wj  denotes  a  write  to  PE  j  and  H  takes  the 
place  of  any  operation  not  involved  in  interprocess  communication 

^  [m]  denotes  the  set  1 1,2,3 . mj. 


I 
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(including  operations  that  transfer  values  to  and  from  the  external 
environment).  ^  Figure  1(a)  is  an  1C  system  representing  the  systolic 
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1(a)  IC  system  representing  systolic  processor 
for  band  matrix  -  vector  multiplication 


1(b)  Communication  graph  for  the  IC  system  of 
Figure  1 (a) 


Figure  1. 

processor  for  band  matrix-vector  multiplication  with  a  bandwidth  of  four 
[1];  only  interprocess  reads  and  writes  appear  in  the  model,  all  other 
operations  are  replaced  by  Figure  1(b)  shows  the  communication 
graph  for  this  system;  each  vertex  represents  a  PE  and  a  directed  edge 
from  node  i  to  node  j  represents  a  communication  link  over  which  the 
i-th  PE  writes  to  the  j-th  PE  and  the  j'-th  PE  reads  from  the  i-th  PE. 

We  define  the  execution  of  an  IC  system  terms  of  two  sequences, 

C\CZ,C3,...  and  T°,Tl,T2 .  Each  element  of  the  first  sequence  is  an  m- 

vector  which  gives  the  program  counter  values  for  all  PEs  (a  program 
counter  value  is  the  index  of  a  set  of  operations).  Each  element  of  the 

t  Nolo  that  we  use  standard  set.  notation  to  represent  both  sots  and  the  symbols 
of  our  alphabet;  the  distinction  will  be  clear  from  the  surrounding  context.  In  our 
figures,  wc  will  use  rectangular  boxes  to  enclose  sets  rather  than  the  usual  brace 
notation. 


second  sequence  is  an  mxm  matrix  of  strings,  giving  the  status  of  com¬ 
munications  in  terms  of  a  generic  message  X.  The  status  of  communica¬ 
tions  on  the  link  from  1JE  %  to  PE'  j  is  given  by  tif\  lit  =  Xn  means  that 
there  have  been  n  unanswered  writes;  ti}  =  (A"1)*  means  that  there  have 
been  n  unanswered  reads;  and  ftj  =  \  means  that  there  are  no  outstand¬ 
ing  reads  or  writes  (X  represents  the  null  string).  The  sequences  together 
describe  the  execution  of  a  system;  for  all  *>0,  Ck  describes  the  set  of 
operations  that  will  be  attempted  on  the  A-th  execution  step  and  7* 
describes  the  status  of  communications  if  all  of  those  operations  com¬ 
plete. 

To  start  the  sequences,  we  define  0^=1  for  all  ie[m]  and  <<j=X  for  all 
i,j e[m ];  C1  shows  all  PEs  executing  their  first  set  of  operations  and  T° 
shows  that  there  are  no  outstanding  reads  or  writes.  The  remainder  of 
the  sequence  of  Cs  is  defined  to  reflect  the  fact  that  a  PE  moves  to  a  new 
set  of  operations  only  if  all  operations  in  its  previous  set  have  completed: 

'  c*+ 1  if  UNBLOCKED(i,Vdct).Tk) 

ck*1  = 

4  ck  otherwise 

where  the  notation  V(j)  denotes  the  j-th  symbol  in  some  word  generated 
by  the  expression  V  ^  and  UNDLOCKED{i,S  ,T)  is  true  if  the  i-th  PE  can 
execute  all  operations  in  set  S  when  the  status  of  communications  is 
described  by  T.  The  exact  form  of  UNBLOCKED  depends  on  the  mode  of 
execution,  synchronous  or  data-flow,  and  is  discussed  below.  The 
remainder  of  the  sequence  of  T s  is  defined  to  reflect  the  execution  of 

*  Nolo  thol  for  all  loop  programs,  V(j)  is  a  unique  symbol.  This  notation  will  also 
bo  used  for  processes  that  execute  an  initialization  sequence  before  entering 
their  loop.  These  PEs  are  represented  by  regular  expressions  of  the  form  /Jo 
where  a  and  /J  are  sequences  over  the  alphabet  and,  again,  the  j’-th  symbol  is 
unique. 
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rcad  and  write  operations: 
tfjl-a  tij  b  where 


and 


a 


X  if  uijGH(c^  +  1)  a  (k=0  v  c£ M  *  cf) 
A  otherwise 


I  X  1  if  r(E^(c/+1)  a  (*=0v  c^1  #  c‘) 

^  “  1  A  otherwise 

and  X  X~l  =  A.  We  observe  that  our  execution  rules  are  more  general  and 
more  realistic  than  those  used  in  many  models  because  we  do  not  insist 
that  all  of  the  operations  in  a  set  execute  simultaneously.  Depending  on 
the  definition  of  UNBLOCKED,  it  is  possible,  for  example,  to  allow  indepen¬ 
dent  reading  and  writing  on  different  ports. 

The  execution  of  an  IC  system  is  parameterized  by  the  predicate 
UNBLOCKED.  When  the  predicate  is  TRUE,  the  IC  system  is  synchronous, 
that  is,  all  operations  execute  on  every  time  step.  A  correct,  synchro¬ 
nous  system  should  have  the  property  that  corresponding  reads  and 
writes  are  simultaneous.^  More  precisely,  if  during  synchronous  execu¬ 
tion,  tfj= A  for  all  i,  j  and  k,  we  say  that  the  system  is  strongly  coordi¬ 
nated.  When  the  predicate  UNBLOCKED{i,S ,T)  is 

Vje[m]  ( r}cS  =>  t^cX*) 

the  IC  system  is  data  flow,  that  is,  read  operations  execute  only  when 
values  are  present.  A  correct,  data-flow  program  should  have  the  pro¬ 
perty  that  none  of  the  individual  PEs  deadlock.  We  say  that  a  system  is 
valid  if 

t  It  is  more  common  to  assume  that  a  read  executes  Immediately  after  its 
corresponding  write.  We  have  chosen  simultaneous  reads  and  writes  to  be  con¬ 
sistent  with  VLSI  technology  and  to  simplify  our  discussion.  AH  of  our  algorithms 
can  be  easily  modified  to  incorporate  any  fixed  delay  for  message  transmission. 
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Vi  e[m,  ]  V*  aO  3;  >k  (cfitcD 
when  the  system  is  executed  in  data-flow  mode. 

We  remark  that  the  model  developed  here  differs  from  the  well- 
known  vector  addition  system  model  [2]  and  the  Petri  Net  model  [3].  In 
the  VAS  model,  there  is  a  specific  execution  mode:  transition  vectors  are 
applied  only  if  all  relevant  coordinates  are  positive  and  when  a  transition 
vector  is  applied,  all  coordinates  are  updated  simultaneously.  There  is 
also  a  specific  execution  inode  for  Petri  Nets:  transitions  fire  only  if  all 
incident  places  contain  a  token  and  all  token  values  are  updated  simul¬ 
taneously.  In  contrast,  IC  systems  may  execute  in  either  synchronous  or 
data-flow  mode.  In  synchronous  mode,  operations  execute  as  soon  as 
they  are  attempted.  In  data-flow  mode,  execution  is  conditional  on  the 
appropriate  values  being  available  as  in  the  VAS  and  Petri  Net  models. 
However,  even  in  data-flow  mode,  our  model  differs  from  the  other  two 
since  operations  execute  whenever  they  are  enabled  and  the  input  and 
output  of  an  instruction  are  not  necessarily  simultaneous. 

Variants 

We  would  like  to  convert  data-flow  programs  into  strongly  coordi¬ 
nated,  synchronous  programs.  For  such  algorithms  to  be  useful,  the 
resulting  program  must  perform  the  same  computation  as  the  original 
program.  To  make  this  more  precise,  we  define  the  notion  of  the  set  of 
reads  preceding  a  specific  write.  Writes,  in  data-flow  mode,  execute  on 
the  first  step  in  which  they  are  attempted;  the  set  of  writes  executed  by 
PE  i  in  execution  step  k,  WRITES  {i,k),  is 

|  wf  |  tUjCF<(cf)  a  (*  =  ’  v 
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Rcads,  in  data-flow  mode,  may  block  temporarily  and  so  a  read  executes 
in  the  first  step  in  which  it  was  attempted  and  the  corresponding  data 
was  available;  the  set  of  reads  that  PE  i  executes  in  step  k,  READS(i,k),  is 

j  Tj  |  r^V^cf)  a  a  ( (tfr*  =  *-')  V 

(  (*  =  1  V  c{  *  c?~l)  A  (//«  =  tf-'X-'  V  wi  e  WRITES  {j  ,k))  )  )j  . 

This  means  that  a  read  in  the  current  operation  set  executes  on  step  k  if 
it  is  no  longer  pending  after  k  (tX^X~l)  and  one  of  three  conditions  is 
met:  it  had  been  pending  in  the  previous  step  (tfj1  =  Jf-1);  or  it  was  first 
attempted  in  step  k  (tilvCjVof'1)  and  there  were  unanswered  writes 
available  =  tfol-X~l);  or  it  was  first  attempted  in  step  k  and  a 
corresponding  write  also  occured  in  step  k  (n^e  WRITES  (j  ,k)  ). 

The  f-th  write  from  PE  i  to  PE  j  occurs  on  execution  step  k  such  that 

*  fl  if  WjCWRfTESii.p) 

1  =  ZtXP  Whcrc  **  =  0  otherwise 

p=i 

and  the  set  of  reads  that  precede  that  f-th  write,  PREADS(l,i,j),  is  the 
multi-set  u  READS{i,p)  .  From  this,  we  can  define  the  relationship  that 

p=i 

we  wish  to  hold  between  the  original  data-flow  system  and  our  con¬ 
structed.  synchronous  system. 

In  terms  of  our  abstraction,  we  will  say  the  constructed  system  P' 
performs  Lite  same  computation  ns  the  original  system  P  if  three  require¬ 
ments  are  met.  The  first  requirement  is  that  a  PE  communicates  with 
the  same  set  of  PEs  in  both  systems.  Our  second  requirement  is  that 
there  is  at  least  as  much  data  available  to  a  PE  at  the  time  of  any  write  in 
P’  as  there  was  available  in  P.  This  second  requirement  will  be  true  if  the 
set  of  reads  that  precede  any  write  in  P  is  a  subset  of  the  set  of  reads 
that  precede  that  same  write  in  P\  Thus,  we  allow  reads  to  occur  "ear- 
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Her”  and  writes  "later"  in  the  constructed  system  than  they  did  in  the 
original  system;  we  assume  that  resulting,  additional  data  is  buffered 
within  the  PE.  To  insure  that  the  PEs  remain  finite,  our  third  require¬ 
ment  is  that  the  amount  of  this  additional  buffering  is  bounded.  Putting 
this  together,  we  say  the  new  system  P'  is  a  variant  of  the  original  system 
P  if 

(i)  they  have  the  same  communication  graphs 

(ii)  for  each  pair  of  PEs  i  and  j  and  for  all  i> 0 

PREADS  ( l  ,i ,3 ) cPREADS'(l  .i ) 

and 

(iii)  there  is  some  b  such  that  for  each  pair  of  PEs  i  and  j  and  for  all 
0 

| PREADS —  PREADS (t ,i .» |  £6  . 

We  present  the  following  propositions  without  proof 


Proposition  1:  The  relation  "variant  of"  is  transitive. 

Proposition  2:  If  P  =  KltE2 . Vm  is  a  valid,  loop  program  and  n1,ne,...,nTO 

are  integers  greater  than  0,  then 

v1n\vzne . C"* 

is  a  variant  of  P. 


The  problem  that  wc  consider  in  the  remainder  of  this  paper  can  now 
be  formally  stated: 

Given  a  valid,  data-flow  loop  program,  construct  a  strongly  coordi¬ 


nated  variant. 
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The  Coordination  Problem 

The  coordination  problem  cannot  be  solved  for  all  data-flow,  loop  pro¬ 
grams.  Consider,  for  example,  the  system  in  Figure  2.  We  define  the 
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Processor  A :  ( 

Processor  B :  ( 

Figure  2.  An  IC  system  that  has  no  balanced  variant 


rate  at  which  a  PE  uses  a  communication  link  to  be  the  number  of  reads 
or  writes  by  that  PE  to  the  link  in  one  cycle  of  its  execution.  The  PEs  in 
the  example  communicate  across  the  link  from  B  to  A  at  the  same  rate 
but  they  communicate  across  the  link  from  A  to  B  at  different  rates. 
Intuitively,  to  strongly  coordinate  this  system,  the  cycles  of  A  must 
"speed  up"  relative  to  the  cycles  of  B.  Any  speed  up  of  A,  however, 
causes  the  communication  rates  across  the  link  from  A  to  B  to  differ. 
This  new  mismatch  can  only  be  corrected  by  speeding  up  the  cycles  of  B 
relative  to  the  cycles  of  A,  returning  us  to  the  original  problem.  There  is 
no  strongly  coordinated  variant  of  the  system  in  Figure  2.  The  problem 
with  l he  system  is  not  simply  a  matter  of  unmatched  data  rates:  the  data 
rates  across  the  link  of  the  system  in  Figure  3(a)  are  also  unmatched  but 
the  system  has  a  strongly  coordinated  variant  shown  in  Figure  3(b).  The 
distinction  between  systems  that  can  be  coordinated  and  systems  that 
cannot  be  coordinated  is  more  subtle. 


Defining  ON{i,j)  to  be  the  number  of  writes  by  PE  i  to  PE  j  in  Vt  and 
OFF{i,j)  to  be  the  number  of  reads  by  PE  >  from  PE  i  in  Vf,  wc  say  that  a 
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Figure  3. 

system  is  balanced  if  the  following  set  of  balancing  equations  has  a  solu¬ 
tion  in  which  all  x(  =  l 

\ON{i,j)  Xi  =  OFF(i.j)xj\i.je[m] |  . 


Neither  the  system  in  Figure  2  nor  the  system  in  Figure  3(a)  are  bal¬ 
anced;  the  system  in  Figure  3(b)  is  balanced. 


Lemma  1:  There  is  at  most  one  independent,  non-trivial  solution  to  a  set 
of  balancing  equations. 

Proof:  Form  a  spanning  tree  T  of  the  undirected  graph  underlying  the 
communication  graph  for  the  given  system.  Each  vertex  of  T 
corresponds  to  a  variable  in  the  balancing  equations  and  each  edge  of  T 
corresponds  to  one  of  the  equations,  it  is  sufficient  to  show  that  each  of 
the  (m  — l)  corresponding  equations  arc  independent.  Consider  a  variable 
xj  corresponding  to  a  leaf  node  of  T.  There  is  exactly  one  edge  e  to  the 
node  corresponding  to  xt  and  so  there  is  one  equation  represented  by  T 
that  uses  x<.  That  equation  must  be  independent  of  the  other  (to- 2)  equa¬ 
tions  represented  by  T-\a  j  and  by  induction  those  equations  must  be 
independent  of  each  other.  // 
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A  loop  program  is  said  to  be  balancable  if  its  balancing  equations  have  a 
solution  in  which  all  of  the  xt  are  integers  greater  than  0.  The  system  in 
Figure  2  is  not  balancable  because  there  is  no  nontrivial  solution  to  the 
set  of  equations 

2-1,4  =  3xb 
xo  ~  xa  ■ 

The  system  in  Figure  3(a)  is  balancable  because  x,  =  1  and  x2  =  2  is  a  solu¬ 
tion  to  the  equations.  If  a  loop  program  . Vm  is  balancable  then  a 

solution  to  the  balancing  equations  1^X2 . xm  can  be  found  in  0(n3)  time 

and  by  Proposition  2,  we  can  construct  a  balanced  variant,  K,’.  V2‘ . Km' 

by  setting  each  =  (p(Kt)x‘)*- 

We  can  now  state  the  relationship  between  loop  programs  which  can 
be  strongly  coordinated  and  balancable  programs. 

Theorem  i:  A  valid,  loop  data-flow  program  can  be  strongly  coordinated  if 
and  only  if  it  is  balancable. 

Proof: 

(<=)  This  proof  is  given  later  as  the  proof  of  our  Wave  Algorithm. 

(=>)  Let  P  be  a  valid  data-flow  program  and  let  P'  be  a  strongly  coordi¬ 
nated  variant  of  P.  Because  P‘  contains  only  loop  programs,  it  is  possible 
to  consider  the  c  values  for  any  PE  i  as  integers  modulo  the  length  of  1{. 
With  this  change  in  program  counter  values,  P'  is  finite  state  since  there 
is  no  buffering  of  transmitted  values.  Therefore  there  is  some  state,  q, 
which  appears  infinitely  often  in  the  execution  of  P’  and  the  execution 
sequences  appearing  between  any  two  consecutive  occurcnccs  of  q  must 
be  the  same.  Consider  an  arbitrary  PE  i,  and  let  0  be  the  multi-set  of 
operations  it  executes  in  a  single  cycle  and  let  E  be  the  multi-set  of 
operations  it  executes  as  the  system  moves  from  one  occurence  of  q  to 
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thc  next.  Let  On  be  n-fold  union  of  O. 

Claim:  E  =  0n  for  some  na  l. 

Proof:  Suppose  not.  Then  let  Y  =  E  -  0r  where  r  is  the  greatest 
integer  such  that  OrcE.  Y  is  the  set  of  "extra"  operations  that  do  not 
form  a  complete  cycle.  Suppose  there  is  some  write  operation,  wt,  in 
Y.  Then  Y  must  contain  all  of  the  read  i  pe rations  in  O  as  well,  since 
otherwise  the  writes  to  PE  j  would  "move  up"  relative  to  the  reads  by 
PE  i  and  eventually,  for  some  k,  we  would  have 

PREADS  (k ,i ,j  )cPREADS {k  ,ij)  . 

Suppose  there  is  a  read,  r;,  in  Y.  Then  Y  must  also  contain  all  of  the 
write  operations  in  0  as  well,  since  otherwise  the  reads  would  "move 
up"  relative  to  the  writes  by  PE  i  and,  for  any  bound  6  there  would  be 
some  k  for  which 

|  PREADS1  {k  ,j  ,i )  -  PREADS (fc,j,i)|  >  6. 

Unless  Y  is  empty,  we  have  a  contradiction.  This  completes  the  proof 
of  the  claim. 

So  for  each  PE,  E  =  0n  for  some  integer  n.  Choosing  xt  to  be  the 
appropriate  value  of  n  for  PE  i,  the  xt's  form  the  desired  solution  to  the 
balancing  equations.  // 

The  class  of  programs  that  can  be  strongly  coordinated  is  quite  large 
and  it  includes  for  example  most  of  the  systolic  and  pipelined  algorithms. 
As  another  characterization,  an  IC  system  lias  the  finite  buffer  property 
if,  when  executed  in  data-flow  mode,  there  is  some  integer  6  such  that  for 
all  i,  j  e[m]  and  fca 0,  tfj£b  .  This  is  obviously  a  desirable  characteristic 
for  any  data-flow  program  and  we  show 


-  14  - 


Theorem  2:  Any  valid,  loop  program  with  the  finite  buffer  property  can  be 
strongly  coordinated. 

Proof:  If  D  is  a  valid,  loop  program  with  the  finite  buffer  property,  then, 
as  above,  there  must  be  some  state  which  repeats  infinitely  often  in  the 
execution  sequences  for  D.  Between  every  two  consecutive  occurences  of 
this  state  in  a  sequence,  a  PE  must  execute  and  integral  number  (greater 
than  0  because  the  system  is  valid)  of  its  cycles  and  data  rates  onto  and 
off  of  each  communication  link  must  be  equul.  As  a  result,  if  we  set  xt  to 
the  number  of  cycles  PE  i  executes  during  this  sequence,  then  the  xt’s 
form  a  solution  to  the  balancing  equations  in  which  all  **>0.  // 

From  this  theorem  and  the  example  in  Figure  3(a).  we  can  conclude 

Corollary:  The  set  of  valid,  loop  programs  wiLh  the  finite  buffer  property 
is  properly  contained  in  the  set  of  valid,  loop  programs  that  can  be 
strongly  coordinated. 

In  the  next  section  of  this  paper,  we  present  our  algorithms  for  con¬ 
verting  data-flow  programs  into  a  strongly  coordinated  programs.  The 
algorithms  work  only  for  balancablc,  valid  loop  programs.  We  have  shown 
how  to  determine  whether  or  not  a  program  is  balancablc,  now  we  show 
how  to  determine  whether  or  not  it  is  valid. 

Theorem  3:  If  a  loop  program  is  balancablc,  then  there  is  an  efficient 
method  for  testing  its  validity. 

Proof:  Let  5  be  a  balancablc  loop  program  and  let  B  be  its  balanced  vari¬ 
ant  constructed  as  above.  The  words  generated  by  each  PE  have  not  been 
changed  in  B,  so  .S’  is  valid  if  and  only  if  B  is  valid.  If  B  =  Flt  V2 . Vm,  con¬ 
struct  the  system  D  "  p( l'i) f }*.p( l'a){ !* . p(Fm)H*  • 

Claim  !l  is  valid  if  and  only  if  D  is  valid. 
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Proof: 

(=>)  Immediate  since  B  is  balanced. 

(<=)  Suppose  that  D  is  valid  and  B  is  not  valid.  Then  there  must  be 
some  subset  of  the  PEs,  pi.p2,..  *>n.  which  become  circularly  blocked; 
that  is,  for  all  ie[n],  PE  i  blocks  on  a  read  from  its  successor,  that  is, 
PE  (i  mod  n)+ 1.  Consider  the  first  point  in  the  execution  sequence  at 
which  this  circular  blocking  occurs.  Since  B  is  balanced,  the  PEs 
must  all  be  on  the  same  iteration  of  their  cycles  at  this  point  and 
within  this  iteration,  the  number  of  writes  to  PE  i  by  its  successor 
must  be  one  less  than  the  number  of  reads  by  PE  i  from  its  succes¬ 
sor.  For  the  circular  blocking  to  arise  in  B,  it  must  be  that  for  all 
ic[n],  the  read  which  blocks  in  PE  i  must  come  before  the  write  that 
releases  its  predecessor.  But  as  a  single  PE  executes  in  B  or  D,  its 
reads  retain  the  same  position  relative  to  its  writes  so  this  must  be 
true  in  D  as  well.  If  the  blocked  reads  all  precede  the  releasing 
writes  in  D,  however,  then  D  would  block  on  the  same  operations. 
This  is  a  contradiction  since  D  is  valid,  completing  the  proof  of  the 
claim. 

D  can  be  tested  for  validity  by  executing  it  until  it  reaches  a  step  k  for 
which  for  all  i,  cf  >  |p(K)l  v  V  l>k(c*=c{)  .  Once  such  a  stable  state  has 
been  reached,  the  validity  of  D  can  be  tested  by  determining  whether  or 
not  all  read  and  write  operations  have  completed.  If  s  is  the  number  of 
operations  executed  by  PEs  in  a  single  cycle  of  S,  D  will  execute  for  at 
most  s  steps  and  so  the  test  requires  0(s)  time.  // 

The  Conversion  Algorithms 

In  this  section,  wc  provide  algorithms  for  automatically  converting  a 
data-flow  loop  program  into  a  strongly  coordinated  variant  when  possible. 


-  16  - 


For  an  arbitrary  program  P,  wc  start  by  constructing  a  balanced  variant 
and  testing  it  for  validity.  If  P  is  balancable  and  valid,  then  its  balanced 
variant  is  coordinated  with  one  of  the  two  algorithms  presented  in  this 
section.  Proposition  1  insures  that  the  resulting,  strongly  coordinated 
system  is  a  variant  of  P. 

Starting  with  a  balanced,  valid  variant,  we  construct  a  strongly  coor¬ 
dinated  variant  with  the  following  algorithm. 


Algorithm  1:  Wave  algorithm  to  coordinate  loop  data-flow  programs 
Input .  A  valid,  balanced,  loop  program,  Vt.Ve . Vm 

Output:  A  strongly  coordinated  variant  of  the  given  program,  K,',Vy . Km’ 

Method: 

1.  Form  expressions  R\,Rz . Rm  from  the  given  expressions  where 

Hi  =  p(K)(|})*. 

2.  Compute  the  data-flow  execution  sequences  C'.C2 . C*  and 

T°J'1 . Tk  where  k  is  the  least  integer  for  which  c*  >  |p(K)|  for  all 

i. 

3.  For  each  i  and  for  1  =  1,2 . k,  set  Vi'(0  to 

READS (i.lMwj  {THREADS {j ,l)\. 

Theorem  4:  The  Wave  Algorithm  constructs  a  strongly  coordinated  variant 
of  any  valid,  balanced,  loop  program. 

Proof:  Since  the  original  system  is  valid,  we  are  assured  of  finding  a  value 
for  k  in  step  2.  By  the  construction  in  step  3,  writes  can  only  occur  in  the 
some  step  as  their  corresponding  reads  so  the  system  is  strongly  coordi¬ 
nated  (the  complete  justification  of  this  appears  in  a  paper  on  testing 
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coordination  properties  [4]).  It  remains  to  show  that  the  constructed 
system  is  a  variant  of  the  original  system.  Since  the  set  of  reads  executed 
in  any  cycle  of  a  PE  is  the  same  in  both  the  given  and  constructed  sys¬ 
tems,  requirements  (i)  and  (in)  for  a  variant  are  trivially  satisfied  and, 
for  requirement  (ii),  it  is  sufficient  to  consider  just  the  first  k  execution 

steps  of  the  system.  For  1  =  1.2 . k,  it  is  obvious  that 

READS  {<.1)  =  READS' (i ,1)  for  all  i  where  HEADS  is  defined  for  the  execution 
sequences  of  the  given  system  and  READS’  is  defined  for  the  execution 
sequences  of  the  constructed  system.  Suppose  the  second  requirement 
is  violated  by  the  1-th  write  from  PE  i  to  PE  j  which  occurs  on  the  r-th 
step  of  the  execution  sequence  for  the  constructed  system  and  the  s-th 
step  of  the  execution  sequence  for  the  given  system  where  r<s.  The  write 
in  the  constructed  system  occurs  in  the  same  step  as  its  corresponding 
read  in  both  systems.  Therefore  in  the  original  system  the  read  that 
corresponds  to  the  write  in  step  s  must  occur  in  step  r  (before  s),  which 
is  not  possible  by  the  definition  of  data-flow  execution.  // 

If  s  is  the  total  number  of  operations  executed  by  PEs  in  a  single  cycle  of 

V{,  KV . vn’,  then  fc<s  and  for  all  i,  | p(  V* ’)  | .  The  algorithm  builds  each 

symbol  of  each  K'  and  so  it  requires  0{ms)  time. 

Figure  4  is  an  example  of  a  valid,  data-flow  system  and  its  strongly 
coordinated  variant  constructed  by  this  algorithm.  The  name  of  the  algo¬ 
rithm  comes  from  the  fact  that  a  single  cycle’s  data  passes  through  the 
entire  system  before  any  PE  starts  its  next  cycle.  For  this  example,  the 
result  is  nearly  optimal  because  the  data  dependencies  of  the  program  do 
not  allow  any  of  the  PFs  to  get  more  than  a  few  operations  ahead  of  the 
remaining  PEs.  However,  if  the  original  system  is  changed  even  slightly, 
as  in  Figure  0,  the  result  is  unsatisfactory.  In  this  case,  a  better  solution 
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is  to  allow  Processors  A  and  D  to  start  a  full  cycle  ahead  of  the  others. 
After  they  have  completed  their  first  cycles,  Processor  B  can  begin  exe¬ 
cuting  its  first  cycle,  while  Processors  A  and  D  continued  with  their 
second  cycles.  By  the  third  cycle  of  A  and  D,  all  processors  are  executing 
on  every  step.  This  more  efficient  solution,  pictured  in  Figure  6,  main¬ 
tains  the  original  three  step  cycle  for  the  processors.  The  writes  have 
been  moved  "forward"  so  that,  ror  example,  the  write  which  occurs  at  the 
beginning  of  the  second  cycle  for  Processor  A  is  delayed  from  its  first 
cycle.  The  wg  in  the  third  cycle  of  Processor  D  is  delayed  from  its  second 
cycle  and  the  wc  in  the  third  cycle  of  Processor  D  is  delayed  from  its  first 
cycle.  The  solution  was  constructed  by  the  following  coordination  algo¬ 
rithm  for  systems  with  acyclic  communication  graphs. 
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Algorithm  2:  Buffered  Write  Algorithm 

Input:  A  valid,  balanced,  loop  program,  V1,V2 . Vm  with  an  acyclic  com¬ 

munication  graph. 

Output:  A  strongly  coordinated  variant  of  the  given  system,  V{,V2 . Vm'. 

Method: 

1.  Label  the  nodes  of  the  communication  graph  with  the  length  of  the 
longest  path  from  a  source  node  (a  node  with  no  predecessors)  to  the 
node.  Let  LMAX  be  the  depth  of  the  graph. 

2.  Let  n  be  the  maximum  length  of  any  p(l$)  and  form  the  expressions 

Ri.Iiz . Iim  where  for  all  i 

where  l  is  the  label  of  the  node  for  PEi  and  the  expression  E  isp(K) 
with  all  of  the  write  operations  removed  (if  writes  are  the  only  opera¬ 
tions  on  some  step,  replace  them  with  ||). 

3.  For  each  ie[m],  for  each  fc  =  l  to  |p(/?<)  | ,  set  K'(*)  to 

Ri{k)v  |  Wf  I 

Theorem  5:  The  Buffered  Write  Algorithm  constructs  a  strongly  coordi¬ 
nated  variant  for  any  valid,  balanced,  loop  program  that  has  an  acyclic 
communication  graph. 

Proof:  The  fact  that  the  resulting  program  is  strongly  coordinated  follows 
the  construction  in  step  3  as  above,  so  it  remains  to  show  that  it  is  a  vari¬ 
ant  of  the  given  system.  Conditions  (i)  and  (Hi)  for  variance  are  obvi¬ 
ously  met,  leaving  condition  (ii).  The  execution  sequence  for  the  output 
system  enn  be  divided  into  periods  equal  in  length  to  the  cycle  size.  Con¬ 
sider  a  single  PE  which  both  reads  and  writes  (if  a  PE  does  not  both  read 
and  write,  it  trivially  satisfies  the  second  condition).  After  the  initial 


periods  in  which  the  PE  just  idles,  it  executes  all  of  the  reads  from  one 
cycle  during  each  period.  Therefore,  it  is  sufficient  to  show  that  if  the 
reads  from  a  given  iteration  of  the  PE  occur  during  period  k,  then  the 
writes  for  that  iteration  occur  no  sooner  than  period  fc  +  l.  Because  the 
communication  graph  is  acyclic,  this  is  easily  done  by  induction  on  the 
periods  noting  that  (1)  the  writes  executed  during  any  period  are  a  sub¬ 
set  of  the  writes  for  a  cycle  and  (2)  the  first  read  occurs  at  least  one 
period  before  the  first  write.  / / 

The  Ri  will  have  a  common  length  n  equal  to  LMAX  times  l .  In  order  to  set 
the  value  of  each  symbol  in  each  one  of  the  VJs,  all  of  the  symbols  in  the 
corresponding  position  of  the  s  must  be  examined  and  so  the  algorithm 
runs  in  0{m2n )  time. 

This  algorithm  works  for  all  acyclic  loop  programs  but  it  does  not 
always  produce  a  good  solution.  Consider  the  system  in  Figure  7.  The 
Buffered  Write  construction  creates  a  long  initialization  sequence  (the 
maximum  length  is  the  maximum  cycle  size  times  the  number  of  PEs) 
which  means  that  many  of  the  PEs  idle  for  long  times  and  that  the  length 
of  the  PE  code  increases.  The  extra  idling  is  probably  not  significant 
since  we  can  assume  in  most  cases  that  the  number  of  PEs  will  be  much 
smaller  than  the  number  of  iterations  required.  The  longer  code,  how¬ 
ever,  is  a  more  serious  problem  since  PEs  will  normally  have  a  very  lim¬ 
ited  amount  of  memory.  For  this  example,  a  better  solution  is  the  coordi¬ 
nated  program  in  Figure  0  in  which  each  of  the  writes  has  simply  been 
moved  "forward"  two  steps.  Because  the  movement  was  within  one  cycle, 
the  PEs  do  not  have  to  stagger  their  starts.  The  BufTercd  Write  Algorithm 
can  be  modified  to  produce  this  code  by  "preprocessing"  the  communica¬ 
tion  graph  to  eliminate  links  for  which  all  writes  appear  before  their 
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corresponding  reads. 

As  a  Anal  comment,  notice  that  the  compute  operations  have  been 
completely  ignored  in  our  analysis.  To  be  realistic,  we  would  have  to 
argue  that  our  notion  of  variant  preserves  the  computations  of  the  sys¬ 
tem.  In  fact,  our  definition  does  not  preserve  the  computations  of  the 
system  since  it  does  not  preserve  the  order  of  compute  steps  or  the 
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Figure  8.  Strongly  coordinated  variant  of  the 
system  in  Figure  7. 


information  available  to  a  PE  on  a  compute  step.  The  definition  of  variant 
would  have  to  be  strengthened.  It  should  be  noted,  however,  that  our 
algorithms  could  be  easily  adapted  to  this  stronger  definition  since  they 
retain  the  position  of  all  compute  steps. 


Conclusions 

We  have  presented  a  simple  model  of  parallel  computation  in  which 
both  data-flow  and  synchronous  execution  modes  can  be  harmoniously 
expressed.  Given  certain  programs  defined  using  the  data-flow  execution 
mode,  we  have  shown  aow  to  synthesize  programs  that  are  computation¬ 
ally  equivalent  when  executed  in  the  synchronous  mode.  For  the  class  of 
programs  under  consideration,  we  characterized  those  for  which  this  syn¬ 
thesis  is  possible  using  the  concept  of  "balancable".  Potentially,  our  algo¬ 
rithms  can  be  used  to  shift  the  burden  of  specifying  detailed  timing 
behaviors  from  the  programmer  to  a  compiler. 
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